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(54) Method and apparatus for generating structured document 

(57) A structured document generating method and 
apparatus capable of easily generating a structured 
document matching the document structure of each 
non-structured document, by using a rule directly gener- 
ated from a preset document structure definition for the 
conversion of the non-structured document into the 
structured document A keyword extracting module 
(102) extracts a keyword representative of the docu- 
ment structure from a non-structured document (101) 
by using a keyword extracting rule (103), and a Key- 
word/text model (104) is generated which is described 
by two elements including keywords and other strings. A 
parsing module (105) generated by a process (1 13) of 
automatically parsing the document structure by refer- 
ring to a parsing rule (110) generated by modifying and 
converting DTD (106). performs a parsing process rela- 
tive to the keyword/text model (104) to generate an 
interim SGML document (114). An SGML document 
correcting module (115) modifies the interim SGML 
document (113) and generates a final output of an 
SGML document by referring to DTD different informa- 
tion (109) generated when the parsing rule was gener- 
ated. 
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Description 

BACKGROUND OF THE INVENTION 

Field of the Invention 

The present invention generally relates to manage- 
ment of documents having a regular document format 
such as legal documents, and particularly to a method 
and apparatus for generating a structured document 
from a non-structured document The "non-structured 
document" means a document which does not contain 
information explicitly showing the structure of a docu- 
ment entered through character recognition, a word 
processor, or the like. The "structured document" is a 
document which contains information explicitly showing 
the structure of the document. 

Description of the Related Art 

In a known method of generating a structured doc- 
ument, information explicitly showing the document 
structure is embedded in a text. Generally, a document 
generated by a user (hereinafter called a "document 
instance") often contains a portion for designating a file 
which describes a document structure definition and a 
text content portion. The document structure definition 
defines the document structure and a mark indicating 
an element (the mark is hereinafter called a "tag"). The 
document structure definition is often set in order to effi- 
ciently use a document to be structured. The tag 
defined by the document structure definition is inserted 
into the text content portion in order to explicitly express 
the document structure and uniquely determine a string 
which is an element of the document structure indicated 
by the tag. 

In outputting a document instance structured in the 
above manner, an image to be output is generated by 
referring to a file which describes a layout definition 
defining what format is used for outputting each compo- 
nent (hereinafter called an "element") of the document 
structure. In this method, the document instance and 
the layout definition are independent so that any docu- 
ment instance can be used irrespective of the type of an 
apparatus or system to be used for the output. 

The contents of a string of a structured document 
are explicitly expressed by inserting a tag such as ( 
author name) and (title > which is in one-to-one corre- 
spondence with an element. Therefore, in combination 
with a tool such as a full text search system for struc- 
tured documents, an aggregation of document 
instances themselves can be used as a database, and 
the document contents can be added or changed easily 
Even if part of this database is lost by some failure, it is 
possible to know that this database has a lost portion, 
by comparing the original document structure defini- 
tions with the database of document instances. 

Because of these advantages, structured docu- 
ments are widely used for document management of a 



document processing system which stores and uses a 
large number of documents. Along with this, several 
approaches have been proposed to convert a non- 
structured document such as already present paper 

5 documents and documents entered by a word proces- 
sor, into a structured document. 

JP-A-62-249270 and "Method of Converting Docu- 
ment Image into ODA Structured Document" (Journal of 
Papers of The Institute of Electronics, Information and 

10 Communication Engineers, D-11 Vol. J76-D11 No. 11 
pp. 2274-2284) propose the following method. First, the 
field of a document type of a document is restricted. 
Next, a structured document is generated by using a 
document structure common in the restricted field 

is (hereinafter called a "common document structure") 
and a document structure analysis rule. 

With this method, the document structure usable in 
common in each field of a document such as "technical 
document" and "business document" is set. Then, the 

20 document structure analysis rule is manually generated 
in order to analyze a non-structured document and 
extract a document structured of it. By using the docu- 
ment structure analysis rule, the non-structured docu- 
ment is converted into a document instance matching 

25 the common document structure. If there is an element, 
which is specific to each document structure and unable 
to be expressed by the common document structure 
(hereinafter called an "individual document structure"), 
the document instance matching the common docu- 

30 ment structure is converted into a document instance 
matching the individual document structure. 

With this method, however, the document structure 
subjected to the document structure analysis and the 
document structure analysis rule are dependent upon 

35 the field of a non-structured document. Therefore, in 
order to process a document in a different field, the doc- 
ument structure analysis rule for this field is required to 
be newly generated manually. This work requires a 
large amount of labor. 

40 This method uses a single document structure 
analysis rule considered to have high commonness in a 
plurality type of documents in a specific field. Therefore, 
this single document structure analysis rule is not 
always optimum to each document and an element spe- 

45 cif ic to an individual document structure cannot be ana- 
lyzed directly. In this case, it becomes necessary after 
the document structure analysis to convert again the 
document instance into another document instance 
matching the individual document structure. Specifi- 

50 cally, tags of the first generated document instance are 
added, changed, or deleted. This work generally 
requires complicated operations and hence a large 
amount of labor. 

Further, this method does not consider a support to 

55 generate a rule for extracting a keyword. Therefore, an 
element as a keyword is required to be manually deter- 
mined and the conditions of layout and string necessary 
for extracting a keyword is also required to be manually 
set. 
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Still further, this method does not provide means for 
supporting to determine an element as a keyword 
(hereinafter called a "keyword-corresponding element"). 
Elements which contain string data are not always 
extracted as keywords. Elements having no characteris- 
tic layout a string are not extracted as keywords, but 
they are deatt as a string between keywords, i.e., a non- 
keyword. 

The restriction condition that "non-keywords should 
not be contiguous in a document instance" is imposed 
when which element is determined to be a keyword-cor- 
responding element. This is because the non-keyword 
is a "string between keywords" and the non-keyword is 
required to be always contiguous to a keyword. How- 
ever, conventional methods have no means for automat- 
ically checking whether an aggregation of elements 
determined as keyword-corresponding elements satis- 
fies the restriction condition. If the aggregation of these 
keyword-corresponding elements does not satisfy the 
restriction condition, some defective or erroneous con- 
ditions occur when the rule for document structure anal- 
ysis is generated or when the document structure is 
analyzed. It is therefore necessary to determine again 
keyword-corresponding elements. This cycle is required 
to be repeated until an aggregation of proper keyword- 
corresponding elements is set. 

Lastly, this method does not support to set the con- 
ditions of layout and string necessary for the extraction 
of a keyword. It is therefore necessary to manually col- 
lect information necessary for the extraction of a key- 
word from a non-structured document itself or rules or 
the like defining the format of the non-structured docu- 
ment. This requires a large amount of labor. 

JP-A-6-290173 gives the following description. A 
document structure indicating each element of a 
labeled document is generated by referring to a 
"schema" describing restricting information of the docu- 
ment structure, and then a structured document is gen- 
erated. 

In JP-A-6-290173, however, although use of the 
schema describing restricting information of the docu- 
ment structure is described, how the schema is gener- 
ated is not described. 

SUMMARY OF THE INVENTION 

It is an object of the invention to solve the above 
problems and enable proper document structure analy- 
sis of documents of a plurality of fields. 

It is another object of the invention to directly ana- 
lyze elements specific to the individual document struc- 
ture and enable to directly generate a document 
instance matching the individual document structure. 

It is a further object of the invention to support to 
generate a rule for extracting a keyword. 

In order to achieve the above objects, the invention 
provides a method of generating a structured document 
for a structured document generating apparatus having 
at least an input/output device, a control unit, and a 



repository wherein a non-structured document not 
explicitly given the document structure and input from 
the input/output device is converted into a structured 
document explicitly given the document structure, in 

5 accordance with a document structure definition defin- 
ing the document structure, the method comprising the 
steps of: mocfifying a given first document structure def- 
inition so as to match the document structure of the 
input non-structured document and generate a second 

10 document structure definition; the control unit generat- 
ing a parsing rule used for performing a parsing process 
suitable for the document structure of the second docu- 
ment structure definition, by modifying marks constitut- 
ing the second document structure definition and 

75 modifying the second document structure definition so 
as to make the positional order of the marks in one-to- 
one correspondence; in accordance with the generated 
parsing rule, generating a first structured document 
from the input non-structured document; and in accord- 

20 ance with difference data between the first document 
structure definition and the second document structure 
definition, converting the generated first structured doc- 
ument into a format matching the first document struc- 
ture definition to thereby generate a second structured 

25 document. 

With the above configuration, conversion from the 
non-structured document to the structured document 
can be performed, for example, by a parsing module 
which analyzes the document structure through parsing 

30 on the basis of extracted keywords. The parsing module 
is generated by converting a given document structure 
definition into a parsing rule by means of a parsing rule 
generating module, and by subjecting this parsing rule 
to a process of automatically generating a parsing mod- 

35 ule. 

In the process of automatically generating a parsing 
module, an aggregation of rules such as "A is consti- 
tuted by patterns B, C,.„" is input and a program for exe- 
cuting a parsing process in accordance with these rules 

40 is output. A particular process to be executed when 
each rule is satisfied can be described in this program. 
Such a process of automatically generating a parsing 
module may be yacc, for example. 

With the above configuration, if the same string in 

45 the same string region is extracted as a plurality of dif- 
ferent keywords, the parsing module of the control unit 
selects a proper one from the plurality of keywords in 
accordance with whether the parsing process succeeds 
or fails. 

so A method of generating a structured document is 
performed in practice as in the following. First a key- 
word extraction module extracts a keyword from the 
non-structured document, and generates a key- 
word/text model of an abstract which represents the 

55 non-structured document as an aggregation of ele- 
ments constituted by keywords and other strings. 

The parsing module performs a parsing process 
relative to the keyword/text model to generate the struc- 
tured document. The parsing module is generated by 
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the parsing module in the following procedure. First, a 
given document structure definition is modified so as to 
match the document structure of the non-structured 
document, and difference therebetween is stored. Next, 
the parsing rule generating module converts the modi- 
fied document structure definition into a parsing rule. In 
this case, when each rule is satisfied, i.e., when each 
element is detected, a program for recording informa- 
tion of the detected element in a corresponding position 
of the keyword/text model is embedded in the parsing 
rule. Then, the process of automatically generating a 
parsing module generates the parsing module which 
realizes the parsing process described in the parsing 
rule. 

The parsing module generated in the above man- 
ner performs a parsing process relative to the key- 
word/text model generated by the keyword extracting 
module, and generates an interim structured document 
matching the modified document structure definition, in 
accordance with the parsing results recorded in the key- 
word/text model. A structured document correcting 
module refers to the difference stored when the docu- 
ment structure definition was modified, and output a 
structured document matching the document structure 
definition before modification. 

A given layout definition and a second document 
structure definition support the generation of a keyword 
extraction rule used for extracting a keyword. The sec- 
ond document structure definition is generated by mod- 
ifying a preset document structure definition so as to 
match the document structure of the input non-struc- 
tured document. 

Specifically, the keyword extracting module com- 
prises: means for extracting layout information from the 
given layout definition, the layout information including 
information about layout and string used when each ele- 
ment of the document structure is output; means for 
extracting information of connection between elements 
from the second document structure definition; means 
for supporting a determination by a user of which ele- 
ment is extracted as the keyword, by using the informa- 
tion of connection between elements; and means for a 
user to edit layout information extracted from the layout 
definition so as to match the layout of the non-structured 
document. 

The means for editing layout information com- 
prises: means for notifying the layout information 
extracted for each element of the document structure to 
the user, the layout information being provided for each 
item necessary for extracting a keyword; and means for 
the user to modify the notified layout information so as 
to match the layout of the non-structured document or to 
supplement missing information. 

With the above structure, the document structure 
and the rule for analyzing the document structure are 
generated by modifying the document structure defini- 
tion preset for each document. Therefore, labor required 
for the design of the document structure for document 
structure analysis and required for generating the rule 



can be reduced. Since the parsing rule dynamically 
generated in accordance with the document structure 
definition of each document is used, it is possible to 
directly generate the structured document matching the 
5 individual document structure without using the com- 
mon document structure, and it is not necessary to con- 
vert the structured document from the format matching 
the common document structure into the format match- 
ing the individual document structure. 

10 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a block diagram illustrating the operation 
outline of a structured document generating system 
is according to an embodiment of the invention. 

Fig. 2 is a diagram showing an example of a non- 
structured document. 

Fig. 3 is a diagram showing part of DTD which is a 
document type definition of an SGML format set for the 
20 document shown in Fig. 2. 

Fig. 4 is a tree diagram showing part of DTD shown 
in Fig. 3. 

Fig. 5 is showing an example of a keyword extrac- 
tion rule in part. 
25 Fig. 6 is a diagram explaining a description constit- 
uent of the format condition of the keyword extraction 
rule shown in Fig. 5. 

Fig. 7 shows an example of extracted keywords. 

Fig. 8 shows an example of a keyword/text model. 
30 Fig. 9 is a block diagram illustrating the operation 
outline of a parsing rule generating module. 

Fig. 10 shows an example of a modified DTD in 

part. 

Fig. 1 1 shows an example of DTD difference data. 
35 Fig. 12 shows conversion rules to be referred to 
when the parsing rule generating module converts DTD 
into a yacc rule. 

Fig. 13 shows an example of an interim yacc rule in 

part. 

40 Fig. 14 shows an example of a parsing rule in part. 

Fig. 15 shows an example of an interim SGML doc- 
ument in part. 

Fig. 16 illustrates an example of a process by an 
SGML document correcting module. 
45 Fig. 17 shows an example of an SGML document 
finally generated by the embodiment method. 

Fig. 18 is a block diagram showing the hardware 
structure of the structured document generation system 
of the first embodiment. 
so Fig. 19 is a diagram illustrating the process outline 
to be executed by the parsing module. 

Fig. 20 shows an example of a keyword/text model 
with tag information being given. 

Fig. 21 is a block diagram illustrating the process 
55 outline to be executed by a keyword extraction rule gen- 
erating system according to a second embodiment of 
the invention. 

Rg. 22 shows an example of extraction of string- 
corresponding elements. 
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Fig. 23 shows an example of the modified DTD 
shown in Fig. 10 described in BNF notation. 

Fig. 24 is a diagram illustrating the procedure of 
obtaining string-corresponding elements capable of 
appearing at the start of each element s 

Fig. 25 shows string-corresponding elements capa- 
ble of appearing at the start and end of each element in 
the modified DTD described in BNF notation shown in 
Fig. 23. 

Fig. 26 is a diagram showing the contiguity relation- 10 
ship between string-corresponding elements in the 
modified DTD described in BNF notation shown in Fig. 
23. 

Fig. 27 shows an example of string-corresponding 
element information. 75 

Fig. 28 shows an example of layout information. 

Fig. 29 shows an example of required items neces- 
sary for extracting a keyword. 

Fig, 30 shows an example of the process of extract- 
ing a required item from the layout definition. 20 

Fig. 31 is a diagram showing an example of an 
interface of a keyword information indicating module. 

Fig. 32 is a flow chart illustrating the processes to 
be executed by the keyword information indicating mod- 
ule 25 

Fig. 33 is a diagram showing an interface of a sup- 
plementary information editing module. 

Fig. 34 is a flow chart illustrating the processes to 
be executed by the supplementary information editing 
module. 30 

Fig. 35 is a flow chart illustrating the process of 
generating a format condition. 

Fig. 36 is a flow chart illustrating the processes to 
be executed by a contiguous element checking module. 

Fig. 37 is a diagram showing an example of the 35 
results processed by the contiguous element checking 
module. 

Fig. 38 is a block diagram showing the hardware 
structure 01 the keyword extraction rule generating sys- 
tem of the second embodiment. 40 

DESCRIPTION OF THE PREFERRED EMBODI- 
MENTS 

Embodiments of the invention will be described with 45 
reference to the accompanying drawings. In this 
embodiment, a structured document generating module 
analyzes a document structure through parsing. As the 
structured document format, an SGML (Standard Gen- 
eralized Markup Language) format is adopted, and as so 
the document structure definition, DTD (Document Type 
Definition) of an SGML document type definition is 
used. The process contents and description rules of 
SGML and DTD are stipulated in ISO (International 
Organization for Standardization) standards IS08879. 55 
The details thereof are explained in "SGML: An Author's 
Guide to the Standard Generalized Markup Language", 
by Martin Bryan, Addison-Wesiey, Publishers, 1988. In 
this embodiment, yacc is used in a process of automat- 



ically generating a parsing module. C language is used 
for describing a process to be added when each rule to 
be inputted to yacc is satisfied. The details of a yacc 
process are explained in a document "How to Use yacc 
and lex" by Takashi SAITHO, HBJ publishing division, 
and the C language is explained in a document "Pro- 
gramming Language C" by B. W. Kernighan and D. M. 
Ritchy, Kyoritsu Publishing Company. 

First, the outline of the first embodiment will be 
described. Fig. 19 is a diagram showing the hardware 
structure of a structured document generating system 
of the first embodiment An input/display device 1 
receives an input entered by a user and displays an 
input non-structured document, a generated structured 
document, or the like. The input/display device 1 is con- 
stituted by a display, a keyboard, a mouse, or the like. 
An external repository unit 2 stores a variety of data for 
structured document generation. This unit 2 is realized 
by a hard disk or the like and constituted by a non-struc- 
tured document repository 21 , a structured document 
generating rule repository 22, and a structured docu- 
ment repository 23. A control unit 3 controls each device 
constituting the system, processes information for struc- 
tured document generation, and is constituted by a con- 
troller 31, an internal memory 32, and a structured 
document generating unit 33. The controller 31 reads 
data stored in the non-structured document repository 
21 and structured document generating rule repository 
22, develops it on the internal memory 32, executes 
processes of the structured document generating unit 
33 on the internal memory 32 by using the developed 
data, and stores the generated structured document in 
the structured document repository 23. The processes 
to be executed include a process 34 of generating a 
parsing module and a process 35 of generating a struc- 
tured document. The parsing module generating proc- 
ess 34 constitutes part of the structured document 
generating process 35. The structured document gener- 
ating process 35 is a process of converting a non-struc- 
tured document stored in the non-structured document 
repository 21 into a structured document by using a 
document structure definition, a keyword extraction rule, 
a rule conversion regulation, and the like respectively 
stored in the non-structured document repository 21. 
The parsing module generating process 34 and the 
structured document generating process 35 can be 
described by known programming languages. 

Next, the outline of processes of the first embodi- 
ment will be described. 

Fig. 1 is a block diagram showing a flow of the 
structured document generating process of the struc- 
tured document generating system of the embodiment. 
A non-structured document 101 is electronic document 
information of sequential character strings generated by 
a word processor, a character recognition apparatus, or 
the like, and is input to the system from the input/display 
device 1. A keyword extraction module 102 extracts a 
keyword from the non-structured document in accord- 
ance with a keyword extraction rule 103. A keyword is a 
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character string expressing a document structure of the 
non-structured document 101. The keyword extraction 
module 102 then separates the non-structured docu- 
ment 101 into keywords and other strings and gener- 
ates an abstract keyword/text model 104 as an 
aggregation of these elements of keywords and other 
strings. A parsing module 105 performs a parsing proc- 
ess described in a parsing rule 1 1 1 to analyze the doc- 
ument structure, the parsing rule 111 having been 
generated, by a parsing rule generating module 110. 

The outline of a method of generating the parsing 
module 105 is as follows: First, a DTD correcting mod- 
ule 107 modifies a DTD 106 to generate a modified DTD 
so as to match the description format of the non-struc- 
tured document 101, and stores difference information 
as DTD difference data 109. DTD 106 is a prepared 
standard document type definition and does not neces- 
sarily match the input non-structured document 101. 
This modification is therefore performed in accordance 
with a comparison result by a system user between the 
non-structured document 101 and DTD 106. The pars- 
ing rule generating module 1 10 refers to a rule conver- 
sion regulation 1 12 and generates the parsing rule 1 1 1 
from the modified DTD 108. Then, yacc 113, which is 
the process of generating a parsing module of this 
embodiment, generates the parsing module 105 in 
accordance with the parsing rule 1 1 1 , the parsing mod- 
ule 105 realizing a parsing process described by the 
parsing rule 111. 

The parsing module 105 performs a parsing proc- 
ess for the keyword/text model 104, and affixes a tag 
representative of the document structure to generate an 
interim SGML document 114. This document is a docu- 
ment instance formed in conformity with the modified 
DTD 108. Therefore, by referring to the DTD difference 
data 109, an SGML document correcting module 115 
modifies the interim SGML document 114 to generate 
an SGML document 116 matching DTD 106. 

Each process of the embodiment will be detailed 
next. 

Fig. 2 shows an example of the non-structured doc- 
ument 101 shown in Fig. 1. This document is obtained 
from an already present paper document regarding a 
law through character recognition. Although there is no 
explicit description showing the document structure, this 
document has a layout of each component easy to read, 
using spaces or the like. In order for the document 
processing system to utilize such a text type electronic 
document, a document type definition (DTD) is set. Fig. 
3 shows an example of DTD for the non-structured doc- 
ument shown in Fig. 2. The opening first line (line 
number 1 , other lines are also represented by line num- 
bers) indicates that the document structure definition 
has a name of "LAW". Second to seventeenth lines indi- 
cate definitions of elements. The name of an element is 
described after "ELEMENT, and after this a model 
group is described between "(" and ")". The model 
group is an aggregation of constituents which form ele- 
ments. These constituents are one or more elements 



and content tokens representative of data such as 
"#PCDATA", or model groups themselves disposed in a 
nest may be used as such constituents. The second line 
indicates that the element "LAW" is constituted by a 

5 series of elements of "PROMULGATION", "ESTAB- 
LISHED REGULATIONNO" TITLE", and "PRESEN- 
TREGULATION". The third line indicates that the 
element "PROMULGATION" is constituted by a series 
of elements of "PROMULGATIONSTATEMENT", 

10 "PROMULGATIONDATE", AND "PROMULGATIONOF- 
FICER". The eleventh line indicates that the element 
"PRESENTREGULATION" is constituted by one or 
more "ARTICLES". The element affixed with such as 
the "ARTICLE" means that more than one element may 

75 be used. The element affixed with an asterisk "*" means 
that the number of elements is optional. The element 
"#PCDATA" at the fourth, fifth, and seventh to tenth lines 
means that the corresponding elements "PROMULGA- 
TIONSTATEMENT", "PROMULGATIONDATE", "OFFI- 

20 CIALTITLE", "NAME", "ESTABLISHEDREGULATION 
NO.", AND "TITLE" each have the string indicating the 
contents of the element. The document structure in a 
tree diagram is shown in Fig. 4. 

In this system, the document structure of a non- 
25 structured document such as shown in Fig. 2 is ana- 
lyzed by directly using DTD such as shown in Fig. 3 to 
generate a structured document which matches DTD. 

The keyword extraction module 1 02 shown in Fig. 1 
refers to the keyword extraction rule 103 to extract a 

30 keyword from the non-structured document 101 and 
generate the keyword/text model 104. An example of 
the keyword extraction rule 103 is shown in Fig. 5. This 
rule is an aggregation of combinations of the name of an 
element to be extracted as the keyword and a layout 

35 condition which describes information about layout and 
string used for the extraction. In Fig. 5, the first item at 
each line is the name of a keyword, and the second and 
following items are the layout conditions. Fig. 6 gives an 
explanation of a description constituent of the layout 

40 condition shown in Fig. 5. For example, the first line 
shown in Fig. 5 means that the format conditions of the 
keyword "OPENINGTITLE" are that a character "O" is 
at the three-space position from the line head, an 
optional length of string follows, and the line ends at a 

45 string "LAW" or "REGULATION". The fourth line means 
that the format conditions of the keyword "PROMULGA- 
TIONDATE" are that a string "SHOWA" or "TAISHO" is 
at the optional-space position from the line head, fol- 
lowed by INTEGER "YEAR" -» INTEGER -> 

so "MONTH" -> INTEGER "DAY" in this order to end the 
line. 

The keyword extraction module 102 shown in Fig. 1 
checks whether there is a string in the electronic docu- 
ment which string matches the format conditions of the 
55 keyword extraction rule. If there is a matching string, it is 
extracted as the keyword (an example of an extracted 
keyword is shown in Fig. 7). Thereafter, the document is 
separated into keywords and other strings to generate 
the abstract keywordAext model 104 which is an aggre- 
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gation of keywords and other strings. Specifically. If 
there is a string which is not a keyword, between key- 
words, it is considered to be a "text" string other than 
keywords, and a keyword/text model such as shown in 
Fig. 8 is configured. The keyword/text model shown in 
Fig. 8 starts from the keyword "OPENINGTITLE", fol- 
lowed by a keyword T ROM ULGATION DATE" -» a key- 
word "ESTABLISHEDREGULATtONNO." a keyword 
PROMULGATIONSTATEMENT"-* a keyword TITLE". 
-» a keyword "ARTICLENO.". Since a string which is not 
a keyword is sandwiched between the keyword "ARTI- 
CLE NO." and the next keyword "PARAGRAPH NO/, 
this string is considered as a text. 

There is a case wherein the same string in the 
same region of the document is extracted as a plurality 
of keywords. For example, in the example of the 
extracted keywords shown in Fig. 7. the string 
"O^PREFECTUREFLOODDEFENCESIG- 
NALREGULATION" at the first and second lines are 
extracted as the keyword of the keyword names of 
"OPENINGTITLE" and TITLE". In such a case, it is 
assumed that the keywords are extracted from the 
same region and a plurality of keyword/text models cor- 
responding to each keyword are generated. The key- 
word/text model shown in Fig 8 is formed by selecting 
the "OPENING TITLE" from the region conflicting key- 
word names "OPENINGTITLE" and TITLE". Of the plu- 
rality of keyword/text models, the model which the 
parsing module 105 fails to parse, is determined as an 
improper keyword/text model. If there is a plurality of 
keyword/text models which succeeded the parsing, an 
optimum one is selected in accordance with a criterion 
such as the number of extracted keywords so that a sin- 
gle SGML document is eventually generated from the 
optimum keyword/text model. 

The parsing module 105 shown in Fig. 1 performs a 
parsing process for the keyword/text model 104 in 
accordance with the parsing rule 111. First, the proc- 
esses of modifying DTD 106 by the DTD correcting 
module 107 and generating the parsing rule 111 will be 
described with reference to Fig. 9. 

First, the DTD correcting module 107 manually 
generates a modified DTD 108 by modifying the 
description contents of DTD 106 set for the non-struc- 
tured document so as to match the description format of 
the non-structured document, and stores the difference 
as the DTD difference data 109. The reason why such 
correction becomes necessary is that there may be a 
contradiction of the description items and order 
between the non-structured document 101 and DTD 
106 used for this system. For example, although DTD 
106 shown in Fig. 3 is prepared for the non-structured 
document 101 shown in Fig. 2, the element for the 
opening title "OAA PREFECTURE FLOOD DEFENCE 
SIGNAL REGULATION" at the first line shown in Fig. 2 
is not given in DTD 106 shown in Fig. 3. In DTD 106 
shown in Fig. 3, elements are disposed in the order of 
"PROMULGATIONSTATEMENT -> PROMULGATION- 
DATE -> ESTABLISHED REGULATIONNO. TITLE", 



whereas in the non-structured document shown in Fig. 
2, the elements are disposed in the order of "PROMUL- 
GATIONDATE -> ESTABLISHED REGULATIONNO. -> 
PROMULGATIONSTATEMENT ^ TITLE". 

5 In order to eliminate such contradiction, the modi- 
fied DTD 108 shown in Fig. 10 is manually generated. 
The meshed portion in Fig. 10 shows the modified ele- 
ments. In order to explicitly indicate the modified por- 
tion, this portion is included by an element <CHANGE 1 

10 The modified portion of the original DTD 106 is stored 
as the DTD difference data 109 such as shown in Fig. 
1 1 . Also in this case, the modified portion is included by 
the element (CHANGE). 

If there is no contradiction of the document struc- 

15 ture between the non-structured document and DTD 
106, it is not necessary to generate the modified DTD 
108 and DTD difference data 109. 

After DTD 106 is modified where necessary, the 
parsing rule generating module 110 executes a rule 

20 conversion process 906 in accordance with the rule 
conversion regulation 112 shown in Fig. 12 to convert 
the element definition described in the modified DTD 
108 into an interim yacc rule 908. Each rule for an 
interim (hereinafter called a "production rule,") is consti- 

25 tuted by right and left sides partitioned by a colon ":" 
such as "A : B C; n . If there is a pattern described at the 
right side is present, the rule is satisfied and the ele- 
ment at the left side is configured. In this example of the 
production rule of "A : B C;\ an element A is generated 

30 if a pattern "B C" is present. 

In DTD, the production rule having the right side of 
"#PCDATA" means that the left side element corre- 
sponds directly to the string of the document structure 
analysis result. In converting the production rule into the 

35 interim yacc rule, if the left side element is an element 
extracted as a keyword in accordance with the keyword 
extraction rule shown in Fig. 5, then #PCDATA is con- 
verted into [#KEY "(KEYWORDNAME)"]. #PCDATA in 
the other production rule is converted into "#TEXT" 

40 meaning a string other than the keyword. For example, 
the production rule converted into [OPENINGTITLE : 
#KEY "OPENINGTITLE! indicates that the keyword 
"OPENINGTITLE" corresponds to the element "OPEN- 
INGTITLE". The production rule converted into [ARTI- 

45 CLESTATEMENT : #TEXTJ indicates that a string other 
than the keyword corresponds to the element "ARTI- 
CLESTATEMENT. 

Fig. 13 shows an example of the yacc rule con- 
verted from the modified DTD shown in Fig. 10. For 

so example, the definition at the fifth line shown in Fig. 10 
is converted into the product rules at the fourth and fifth 
lines shown in Fig. 13. In this case, the "PROMULGA- 
TIONSTATEMENT V shown in Fig. 10 is converted into 
"optO" at the fourth line shown in Fig. 13 in accordance 

55 with the second bottom line rule shown in Fig. 12. The 
definition of "optO" is described at the fifth line of Fig. 1 3. 

If such an interim yacc rule is used, the parsing 
module generated by yacc outputs only a success/fail- 
ure of parsing and does not output the correspondence 
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between the keyword/text model and elements. How- 
ever, in order to generate the structured document by 
using the results of parsing, it becomes necessary, 
when each element analysis succeeds, i.e., when each 
interim rule is satisfied, to add, to the keyword/text 
model, information (hereinafter called "tag information") 
indicating which element corresponds to each constitu- 
ent of the keyword/text model. To this end, the parsing 
rule generating module 110 executes a C language pro- 
gram embedding process 909 for the interim yacc rule 
908 in order to add the tag information to the key- 
word/text model and generate the parsing rule 1 1 1 . An 
example of the parsing rule 910 is shown in Fig. 14. The 
meshed portions illustrate the process of the embedded 
C language programs. In this process, pieces of tag 
information corresponding to the right side elements of 
the production rule are coupled and the tag information 
corresponding to the left side elements of the produc- 
tion rule is generated. 

Referring back to Fig. 1, yacc 113 receives the gen- 
erated parsing rule 111 and generates a parsing mod- 
ule 105 which performs a parsing process in 
accordance with the parsing rule 111. Manual operation 
required during the process of generating the parsing 
module 105 from DTD 106 is only the operation of 
changing the document structure definition so as to 
match the description format of the non-structured doc- 
ument and generating the DTD difference data 1 09. The 
other operations are automatically performed. 

The parsing module 105 analyzes the document 
structure for the keyword/text model 104 to verify 
whether the keyword/text model 104 matches the pars- 
ing rule 111, and adds the tag information representa- 
tive of the document structure detected during this 
process to the keyword/text model 104. The interim 
SGML document 114 is generated from the key- 
word/text model added with the tag information. 

Keywords and texts (hereinafter collectively called a 
"token") of the keyword/text model both correspond to 
"#PCDATA" in DTD of the tree diagram shown in Fig. 4, 
i.e., to the string representing the contents of each ele- 
ment. The keyword is a string in one-to-one correspond- 
ence with each element, whereas the text is a string 
having no correspondence with each element yet. The 
parsing process corresponds to generate the tree struc- 
ture shown in Fig. 4 from the one-dimensional arrange- 
ment of keywords and texts, i.e., the keyword/text 
model. 

The outline of this process by the parsing module 
105 is illustrated in Fig. 19. The parsing module 105 
generated by yacc 1 13 is constituted by a state transi- 
tion table 2004 and a parser 2003 which performs the 
parsing process while referring to the state transition 
table 2004. Described in the state transition table 2004 
are tokens acceptable in a certain state of parsing, and 
information on to which state of parsing is changed 
when a token is accepted. The parser 2003 sequentially 
reads a token starting from the opening token, the 
tokens being a constituent of the keyword/text model 



2001 (2005). If it is judged in a certain state that the 
input token cannot be accepted, it is judged that parsing 
failed (2006 2007). Conversely, if acceptable, the 
state of parsing advances one step in accordance with 

5 the state transition table (2006 -» 2008). In this state, if 
any one of the production rules of the parsing rule 1 1 1 
can be satisfied, the tag information corresponding to 
the production rule is added to the keyword/text model 
2001 (2009 2010 : this process is realized by the 

70 inserted programs shown in Fig. 14). Specifically, if a 
single token corresponds to a certain element, start-tag 
information and end-tag information representative of 
the name of the element are added to the token as a 
pre-tag and a post-tag. For the elements corresponding 

is to a plurality of tokens, the start-tag information and 
end-tag information are added to the start and end 
tokens. The details of adding tag information will be 
later detailed. 

When the last token is input and if the parsing 

20 changes to the state of "normal termination", it is judged 
that the document structure analysis of the keyword/text 
model has succeeded. 

The process when a production rule is satisfied dur- 
ing the parsing will be detailed with reference to the key- 

25 word/text model shown in Fig. 8 and the rule shown in 
Fig. 13. This process realizes the following two func- 
tions. 

(1) To what element a keyword or text corresponds 

30 is determined. For example, if the keyword "ARTI- 
CLENO." at the sixth line of the keyword/text model 
shown in Fig. 8 is input, the production rule at the 
thirteenth line of Fig. 13 is satisfied (which produc- 
tion rule is satisfied in a certain state is described in 

35 the state transition table 2004), and the keyword 
"ARTICLENO." corresponds to the element "ARTI- 
CLENO.". In this case, the start-tag information and 
end-tag information of the "ARTICLENO." are 
added to the pre-tag and post-tag of the keyword 

40 "ARTICLENO." of the keyword/text model (seven- 
teenth and eighteenth lines in Fig. 20). Next, when 
the text at the seventh line of Fig. 8 is input, the pro- 
duction rule at the fourteenth line of Fig. 13 is satis- 
fied so that this text is considered to correspond to 

45 the element "ARTICLESTATEMENT". The start-tag 
information and end-tag information of the "ARTI- 
CLESTATEMENT" are added to the pre-tag and 
post-tag of the TEXT (twenty first and twenty sec- 
ond lines in Fig. 20). 

so (2) Adjacent elements are summarized to a more 
abstract element. 

For example, in Fig. 4, the adjacent elements "PAR- 
AGRAPHNO." and "PARAGRAPHSTATEMENT are 
55 summarized to a more abstract "PARAGRAPH". In the 
example of the keyword/text model shown in Fig. 8, the 
adjacent "PARAGRAPHNO." and the text (correspond- 
ing to "PARAGRAPHSTATEMENT') at the eighth and 
ninth lines are summarized to the one element "PARA- 
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GRAPH" in accordance with the production rule at the 
sixteenth line of Fig. 13. If this production rule is satis- 
fied, the start-tag information of "PARAGRAPH" is 
added to the keyword "PARAGRAPHNO." at the eighth 
line of Fig. 8, and the end-tag information is added to 
the text at the ninth line (twenty fourth and twenty eighth 
lines in Fig. 20). The same operation is performed for 
the combinations of tenth and eleventh lines, twelfth and 
thirteenth lines, and fourteenth and fifteenth lines in Fig. 
8. 

The adjacent "ARTICLENO." (sixth line) and "ARTI- 
CLE STATE ME NT" (seventh line) and a plurality of 
"PARAGRAPHS" (eighth to fifteenth lines) can be sum- 
marized to the element "ARTICLE" in accordance with 
the production rules at the twelfth and fifteenth lines in 
Fig. 13. In this case, the start-tag information of "ARTI- 
CLE" is added to the pre-tag of the keyword "ARTI- 
CLE NO." at the sixth line, and the end-tag information is 
added to the post-tag of the text at the fifteenth line (in 
Fig. 20, only the addition of the start-tag information of 
"ARTICLE" is illustrated at the seventeenth line). 

If the elements are summarized whose constituents 
are keywords representing a number such as "ARTI- 
CLE" and "PARAGRAPH" (in this case. "ARTICLENO." 
and "PARAGRAPHNO."). the first number and the con- 
tinuity between numbers are checked. Namely, it is 
checked whether the number begins with "1" and there- 
after the numbers 1, 2, 3,... are continuous. 

The above process is sequentially performed for an 
input token of the keyword/text model 104. If the tree 
structure shown in Fig. 4 having one root (in the exam- 
ple shown in Fig. 4, "LAW") can be obtained, it is judged 
that the keyword/text model 104 matches the parsing 
rule 1 1 1 and the parsing has succeeded. Conversely, if 
a token input in a certain state during the parsing is not 
acceptable, i.e., if the keyword/text model 104 does not 
match the parsing rule 1 1 1, it is judged that the parsing 
has failed. If in the continuity check of numbers of the 
function (2) described above, the first number is abnor- 
mal or the continuity between numbers is not retained, it 
is judged that the document structure analysis has 
failed. For example, such cases corresponding to the 
number 3 instead of starting from the number 1 or the 
numbers are skipped as in 1 , 2, and 5. 

If the parsing has succeeded, the parsing module 
105 outputs the interim SGML document 1 14 in accord- 
ance with the tag information given to the keyword/text 
model 104. Specifically, the output interim SGML docu- 
ment 114 has tags corresponding to the start-tag infor- 
mation and end-tag information and added to the front 
and back of a string corresponding to each token of the 
keyword/text model 104. An example of the interim 
SGML document 1 14 is shown in Fig. 15. 

As seen from this example, the tag information 
includes the start-tag information and end-tag informa- 
tion, and the end-tag information is not always posi- 
tioned near the start-tag information. For example, 
although the end-tag information < /ARTICLENO. ) for 
the start-tag information (ARTICLENO.) is just two lines 



below, the end-tag information (/ARTICLE > for the start- 
tag information (ARTICLE) is far below the drawing 
space. Therefore, if the document structure is to be 
manually modified when the interim SGML document is 

5 generated, it is required to search the corresponding 
start-tag information and end-tag information over the 
whole of the document, requiring a large amount of 
labor. In this embodiment, necessary modification is 
completed at the stage of DTD so that the generated 

w interim SGML document 114 matches the input non- 
structured document 101 and the modification 
described above is not necessary. 

If a plurality of keywords are extracted from the 
same region, a plurality of keyword/text models are gen- 

15 erated. In this case, the parsing process is performed 
for all the keyword/text models. If an erroneous keyword 
is contained, the parsing fails. If there are a plurality of 
keyword/text models which have succeeded in the pars- 
ing, an optimum keyword/text model is selected in 

20 accordance with, for example, the condition that there 
are a large number of extracted keywords, and a corre- 
sponding interim SGML document is output. This will be 
described by using an example shown in Fig. 7 in which 
two keywords "OPENINGTITLE" and "TITLE" are 

25 extracted from the same string of the non-structured 
document. The keyword/text model generated by 
selecting the TrTLE" fails in the parsing because the 
first line in the modified portion of the modified DTD stip- 
ulates that the "OPENINGTITLE" can appear at the top 

30 of the "LAW" but the "TITLE" cannot appear at the top of 
the "LAW". Therefore, the interim SGML document for 
the keyword/text model generated by selecting the 
"TITLE" is not output On the other hand, the key- 
word/text text model generated by selecting the "OPEN- 

35 INGTITLE" succeeds in the parsing, and the 
corresponding interim SGML document is output as 
shown in Fig. 15. 

If there is the DTD difference data 109, the SGML 
document correcting module 115 modifies the interim 

40 SGML document 1 1 4 in accordance with the DTD differ- 
ence data. The contents of a particular process will be 
described with reference to Fig. 16. The SGML docu- 
ment correcting module 115 generates an instance 
1602 of modified part in DTD which is a partial SGML 

45 document corresponding to the contents described in 
the DTD difference data 109. In this case, a string 
"#PCDATA" representing the contents of the document 
structure is required to be replaced by a corresponding 
string. A change module 1603 for the interim SGML 

so document replaces the string by another string repre- 
sentative of the contents of the element having the 
same name. For example, the "#PCDATA" sandwiched 
between the two tags (PROMULGATIONSTATEMENT) 
AND (/PROMULGATIONSTATEMENT) in the instance 

55 1602 of modified part in DTD is replaced by a string 
"AAPREFECTUREFLOODDEFENCESIGNALREGUL- 
ATIONISTOBEPROMULGATEDASINTHEFOLLOW- 
ING" sandwiched between the same tags, in the 
changes 1603 in the interim SGMLdocument Similarly, 
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the "#PCDATA" sandwiched between the two tags < 
PROMULGATIONDATE) and < /PROMULGATION- 
DATE) is replaced by a string "SHOWA 24, OCTOBER, 
6", and the "#PCDATA" sandwiched between the two 
tags < ESTABLISHED REGULATION NO.) and < 5 
/ESTABLISHED REGULATIONNO.) is replaced by a 
string "AAPREFECTUREREGULATIONN0.78 rt . As in 
the case of the "#PCDATA" sandwiched between the . 
two tags <OFFICIALTITLE) and (/OFFICIALTITLE) in 
the instance 1602 of modified part in DTD, whose ele- 10 
ment having the same name is not included in the 
changes 1603 in the interim SGML document, a string 
"NONE" is forcibly inserted. 

The instance 1602 of modified part in DTD gener- 
ated by the replacement process is replaced by the is 
modified portion of the interim SGML document 114 of 
Fig. 1, i.e., in the example shown in Fig. 15, the portion 
sandwiched between the two tags <CHANGE) AND 
(/CHANGE). In this manner, the SGML document 
matching DTD 1 06 preset for subject documents can be 20 
generated. An example of the SGML document 116 is 
shown in Fig. 1 7. Since the individual document struc- 
ture is directly reflected upon the SGML document, it is 
not necessary as in the conventional case to convert the 
document instance into the individual document struc- 25 
ture. 

Programs realizing the first embodiment may be 
stored in a storage device such as a hard disk, a floppy 
disk, and an optical disk. 

According to the first embodiment described above, 30 
the parsing rule 1 1 1 used for the document structure 
analysis is directly generated from the document struc- 
ture definition set for subject documents. It is therefore 
possible to reduce labor required for the generation of a 
rule. Since the document instance is generated through 35 
parsing in accordance with the document structure 
described in the document structure definition of each 
document, it is not necessary to convert the document 
instance obtained through parsing, from the format 
matching the common document structure into the for- 40 
mat matching the individual document structure. 

Next, the second embodiment will be described. 
This embodiment pertains to a method of supporting to 
generate the keyword extraction rule 103 by using the 
modified DTD and a given layout information. 45 

Similar to the first embodiment, also in this second 
embodiment, an SGML format is adopted as an exam- 
ple of the structured document format, and as the docu- 
ment structure definition, a DTD is used which is a 
document type definition for SGML set for subject docu- so 
ments. 

Fig. 38 is a diagram showing the hardware struc- 
ture of a keyword extraction rule generating system of 
the second embodiment. An input/display device 3910 
receives an input entered by a user and displays an 55 
information about layout, a generated keyword extrac- 
tion rule, or the like. The input/display device 3910 is 
constituted by a display, a keyboard, a mouse, or the 
like. An external repository unit 3920 stores a variety of 



data for keyword extraction rule generation. This unit 
3920 is realized by a hard disk or the like and consti- 
tuted by a modified DTD repository 3921, a layout defi- 
nition repository 3922, a string-corresponding element 
information repository 3923, a layout information repos- 
itory 3924, and a keyword extraction rule repository 
3925. A control unit 3930 controls each device constitut- 
ing the system, processes information for keyword 
extraction generation, and is constituted by a controller 
3931 , an internal memory 3932, and a keyword extrac- 
tion rule generating module 3933. The controller 3931 
reads data stored in the modified DTD repository 3391 
and layout definition repository 3922, develops it on the 
internal memory 3932, executes processes of the key- 
word extraction rule generating module 3933 on the 
internal memory 3932 by using the developed data, and 
stores the generated string-corresponding element 
information and layout information respectively in the 
string-corresponding element information repository 
3923 and layout information repository 3924. The proc- 
esses to be executed include a process 3934 of extract- 
ing document structure information and a process 3935 
of extracting layout information. A process 3936 of gen- 
erating a keyword extraction rule notifies an operator via 
the input/display device 3910 of the string-correspond- 
ing element information stored in the string-correspond- 
ing element information repository 3923 and the layout 
information stored in the layout information repository 
3924, and receives if necessary supplementary infor- 
mation from the operator via the input/display device 
3910. The process 3934 of extracting document struc- 
ture information, the process 3935 of extracting layout 
information, and the process 3936 of generating a key- 
word extraction rule can be described by known pro- 
gramming languages. 

Next, the outline of processes of the second 
embodiment will be described. 

Fig. 21 is a block diagram showing a flow of the key- 
word extraction rule generating system. Reference 
numeral 2201 represents a modified DTD (same as 
DTD 108 shown in Fig. 1) obtained by modifying the 
document structure definition set for subject documents 
so as to match an input non-structured document. The 
modified DTD 2201 defines elements of the non-struc- 
tured document and the relationship between elements. 
A document structure information extracting module 
2202 refers to the modified DTD 2201 and generates 
string-corresponding element information 2203 describ- 
ing elements in direct correspondence with a string 
(hereinafter called a "string-corresponding element") 
and a contiguity relationship between elements. 

Reference numeral 2204 represents a layout defini- 
tion set for subject documents which defines with what 
layout each element is output. A layout information 
extracting module 2205 refers to the layout definition 
2204 and extracts items necessary for generating a key- 
word extraction rule as many as possible from the layout 
used for outputting each element and from the informa- 
tion of an output string. Each item itself is hereinafter 
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called a "required item", and the information extracted 
for each item is called a "required item content". Layout 
information 2206 describes the required item content for 
each string-corresponding element. 

A keyword extraction rule generating module 2207 
informs via an input/display device 2211 an operator of 
the required item content for each string-corresponding 
element in the layout information 2206. This module 
2207 receives information entered by the operator, 
modifies the required item content, and generates a 
keyword extraction rule 2212 in accordance with the 
modified required item content. 

The process by the keyword extraction rule gener- 
ating module 2207 will be described in more particular. 
A keyword information indicator module 2208 informs 
the operator of the name of a string-corresponding ele- 
ment described in the string-corresponding element 
information 2203. If a string-corresponding element is 
set as a keyword-corresponding element and given a 
format condition, this format condition is also displayed 
together with the string-corresponding element. 

A supplementary information editing module 2209 
sets the format condition of each string-corresponding 
element. The supplementary information editing mod- 
ule 2209 refers to the layout information 2206 and dis- 
plays the required item content of the string- 
corresponding element selected by the operator. If the 
displayed required item content is different from the lay- 
out and strings of the non-structured document, the 
operator corrects it. The content of the required item is 
given by the operator if it cannot be extracted by the lay- 
out information extracting module 11 05. In this manner, 
all the required item contents are edited so that they 
match the layout and strings of the non-structured doc- 
ument. After all the required items are edited, the sup- 
plementary information editing module 2209 generates 
the format condition used for keyword extraction by 
using the required item contents. By using the layout 
condition as a return argument, the process is passed 
to the keyword information indicator module 2208. 

The keyword information indicator module 2208 
sets as the keyword-corresponding element the string- 
corresponding element whose format condition was 
generated by the supplementary editing module 2209, 
and displays the layout condition together with the ele- 
ment name. 

With the above processes, each keyword-corre- 
sponding element is determined. A contiguous element 
checking module 2210 inspects at a certain timing 
whether an aggregation of keyword-corresponding ele- 
ments satisfies the restriction condition that non-key- 
words should not be contiguous. The contiguous 
element checking module 2210 refers to the contiguity 
relationship between string-corresponding elements 
described in the string-corresponding element informa- 
tion 2203, and inspects whether string-corresponding 
elements other than the keyword-corresponding ele- 
ments (hereinafter called "non-keyword-corresponding 
elements") are contiguous. If there is a possibility that 



two non-keyword-corresponding elements are contigu- 
ous, the operator generates the layout condition of one 
of the two elements and sets it as the keyword-corre- 
sponding element Conversely, if there is no possibility 

5 that non-keyword-corresponding elements are contigu- 
ous, keyword-corresponding elements are sufficient at 
this timing. At this time, an aggregation of combinations 
of the name of each keyword-corresponding element 
and its format condition is used as the keyword extrac- 

70 tion rule 2212. 

The outline process of the keyword extraction rule 
generating system has been described above. Next, the 
details of each process executed by the system shown 
in Fig. 21 will be described. 

75 The document structure information extracting 
module 2202 refers to the modified DTD 2201 such as 
shown in Fig. 10, extracts each string-corresponding 
element and contiguity possibility information between 
string-corresponding elements, and outputs them as the 

20 string-corresponding element information 2203. 

The string-corresponding element is an element 
having "#PCDATA" representative of a string of the doc- 
ument type definition (modified DTD) as a constituent of 
the model group. Fig. 22 shows the string-correspond- 

25 ing elements of the modified DTD shown in Rg. 10. In 
the example shown in Fig. 22, extracted as the string- 
corresponding elements are the elements "OPEN- 
INGTITLE", "PROMULGATIONDATE". "ESTABLISHE- 
DREGULATIONNO.", "PROMULGATIONSTATE- 

30 MENT", "TITLE", "ARTICLENO.", "ARTICLESTATE- 
MENT, "PARAGRAPHNO.", and "PARAGRAPH- 
STATEMENT. 

The document structure information extracting 
module 2202 checks a possibility of contiguous string- 

35 corresponding elements. The following two specific 
processes are performed. 

(1) An aggregation of string-corresponding ele- 
ments at the start and end of each element is 

40 obtained. For example, in the structured document 
shown in Fig. 15. at the start of the element 
"PROMULGATION" (1501 to 1506), the string-cor- 
responding element "PROMULGATIONDATE" 
(1502 to 1503) appears, and at the end of the ele- 

45 ment "PROMULGATION", the string-corresponding 
element "PROMULGATIONSTATEMENT" (1504 to 
1505) appears. In this process, the elements capa- 
ble of appearing at the start and end of each ele- 
ment are derived from the modified DTD 2201 such 

so as shown in Fig. 1 0. 

(2) A combination of contiguous elements in the 
model group of the modified DTD is obtained. 
There is a contiguity possibility of each combination 
between the string-corresponding elements capa- 

55 ble of appearing at the end of the preceding ele- 
ment and at the start of the succeeding element. 

In this embodiment in order to facilitate the execu- 
tion of these two processes, the modified DTD such as 
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shown in Fig. 10 is converted to have notation of BNF 
(Buckus Naur Form). This conversion procedure con- 
forms with the rule conversion regulation 112 (Fig. 12) 
and is generally the same as the procedure of convert- 
ing the modified DTD 1 08 into the interim yacc rule 908. 
However, in this embodiment, which element is deter- 
mined as a keyword is not known. Therefore, the 
description "#PCDATA" of the modified DTD is not con- 
verted into the description of [#KEY "ARTICLENO."] or 
[#TEXT]. Only in this point, this embodiment differs from 
the rule conversion process 906. 

Fig. 23 shows an example of the modified DTD 
expressed by BNF notation. Also in this embodiment, a 
rule described in BNF notation and obtained by convert- 
ing the definition of each element of the modified DTD is 
called a "production rule". The right side of each pro- 
duction rule, in this embodiment, is called a "content 
model" of the left side element. 

The procedure of obtaining from the modified DTD 
expressed by BNF notation an aggregation of string- 
corresponding elements at the start and end of each 
element, will be described. The algorithm of this proce- 
dure is shown in Fig. 24. The procedure starting from A 
in Fig. 24 uses as an input argument an element, and as 
a return argument an aggregation of string-correspond- 
ing elements capable of appearing at the start of the 
element, and contains a recursive call. The variables 
mg and elem used in this procedure are local variables 
newly generated each time the procedure advances to 
A. Firsl[xx] is a global variable representative of an 
aggregation of string-corresponding elements capable 
of appearing at the start of the element xx. 

In order to obtain an aggregation of string-corre- 
sponding elements capable of appearing at the start of 
each element, the procedure A is executed by using the 
element as the argument (nt in Fig. 24). 

In the procedure A, First[nt] is set to an empty 
aggregation (2501), First[nt] representing an aggrega- 
tion of string-corresponding elements capable of 
appearing at the start of nt. In the nt content model, of 
the element groups partitioned by an OR-connector "| 
the first element group is substituted into the variable 
mg (2502). If the OR-connector does not exist, the 
whole of the content model is substituted into the varia- 
ble mg. The first element of mg is substituted into the 
variable elem (2503). Next, it is checked whether elem 
is a string-corresponding element (2504). If elem is a 
string-corresponding element, elem is added to First[nt] 
(2505) and the flow advances to step 2509, whereas if 
not, the content of First[elem] is added to First[nt] (2508) 
if First[elem] has been set (2506) and the flow advances 
to step 2509. If First[elem] is not set at step 2506, elem 
is used as the argument and the procedure A is recur- 
sively executed (2507). The return argument, i.e.. the 
content of First[elem] is added to First[nt] and the flow 
advances to step 2509. 

At step 2509, it is checked from the content model 
of nt whether mg is the last element group partitioned by 
the OR-connector. If not, the next element group is sub- 



stituted into the variable mg (2510) and the flow returns 
to step 2503. If mg is the last element group, by using 
First[nt] as the return argument, the processing is 
passed to the procedure which called this procedure A 
5 (2511). 

The procedure shown in Fig. 24 is performed until 
First[nt] is set for all elements. In this manner, an aggre- 
gation of string-corresponding elements capable of 
appearing at the start of each element can be obtained. 
10 In order to obtain an aggregation LastQ of string-corre- 
sponding elements capable of appearing at the end of 
each element can be obtained in the similar manner as 
the procedure shown in Fig. 24 by replacing the factors 
shown in Fig. 24 by the following two factors. 

15 

(a) First[xxx] in Fig. 24 is replaced by Last[xxx]. 

(b) The first element at step 2503 is replaced by the 
last element. 

20 Fig. 25 shows FirstQ and LastQ of the aggregations 
of string-corresponding elements capable of appearing 
at the start and end of each element of the modified 
DTD shown in Fig. 10. 

With the above procedures, it becomes possible to 

25 obtain the aggregation FirstQ of string-corresponding 
elements capable of appearing at the start of each ele- 
ment and the aggregation LastQ of string-corresponding 
elements capable of appearing at the end of each ele- 
ment. 

30 Next obtained is a combination of contiguous ele- 
ments in the content model of the document structure 
definition. There is a contiguity possibility of each com- 
bination between component of LastQ of a preceding 
element and a component of FirstQ of a succeeding ele- 

35 ment. An example of this process is illustrated in Fig. 26 
in which the production rule "CHANGE:OPENINGTI- 
TLEPROMULGATIONTITLE" 2402 shown in Fig. 23 is 
processed. In this production rule of the content model 
of the element "LAW", the elements "OPENINGTITLE" 

40 and "PROMULGATION" are contiguous and the ele- 
ments "PROMULGATION" and TITLE" are contiguous 
(2701). Therefore, the element in Fi rst[P ROM U LG A- 
TION] can be backward contiguous with the element in 
Last[OP E N I NGT ITLE] (2702). Namely, the string-corre- 

45 sponding element "PROMULGATIONDATE" can be 
backward contiguous with the string-corresponding ele- 
ment "OPENINGTITLE" (2704). The element in 
First[TITLE] can be backward contiguous with the ele- 
ment in Last[PROMULGATION] (2703). Namely, the 

so string-corresponding element "TITLE" can be backward 
contiguous with both the string-corresponding elements 
"PROMULGATIONSTATEMENV and " ESTABLISH E- 
DREGULATIONNO." (2705). This process is applied to 
all production rules in the document structure definition 

55 expressed in BNF notation. Therefore, an aggregation 
of all string-corresponding elements capable of being 
backward contiguous can be obtained, and this aggre- 
gation is the string-corresponding element information 
(2203 in Fig. 21). An example of the string-correspond- 
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ing element information 2203 is shown in Fig. 27. 

With the procedure described with the drawings up 
to Fig. 26, the document structure information extracting 
module 2202 can generate the string-corresponding 
element information 2203. 5 

Next, the process of the layout information extract- 
ing module 2205 shown in Fig. 21 for extracting the lay- 
out information 2206 from the layout definition 2204 will 
be described. 

The layout definition 2204 is set for subject docu- 10 
ments and defines with what layout each element is out- 
put. Fig. 18 shows an example of the layout definition in 
part prepared for structured documents conforming with 
the document type definition (DTD). Reference numeral 
2901 indicates that reference numerals 2901 to 2911 is 
represent the layout definitions of the element TITLE". 
A [font name] 2902 indicates that the font name used for 
outputting TITLE" is Gothic, and a [font size] 2903 indi- 
cates that the font size is 12 pt (point) which is a length 
unit and 1 pt » 1/72 inch. A [character pitch] indicates 20 
that the character pitch of TITLE" is 1 4 pt. An [offset 1] 
2905 and an [offset 2] 2906 indicate what minimum 
spaces from the right and left sides of a region where a 
document is output are reserved for outputting the con- 
tent of TITLE". A [first-line displacement] 2907 indi- 25 
cates a difference from the [offset 1] of an offset of the 
first line which often takes a different offset from other 
lines. A [connection with previous element] 2908 indi- 
cates which string is output after an element just before. 
In this example, after an element just before is output, 30 
the TITLE" is output on a new line after line feed. A 
[string information] 2909 describes which string is out- 
put. In this example, a string CONTENT corresponding 
to the TITLE", i.e., the string between the tag TITLE) 
and tag (/TITLE), is output. A [placement] 2910 indi- 35 
cates how strings are placed between the area defined 
by the [offset 1] and [offset 2], This [placement] 2910 
takes four values "start", "end", "center", and "justify" 
corresponding to the left alignment, right alignment, 
centering, and equal space. In this example, the string aq 
of TITLE" is output through centering. 

Such layout definitions are essentially used for out- 
putting a structured document and are not used for 
expressing the layout of a non-structured document. 
However, for a document having a regular layout such 45 
as legal documents, the layout definition is often deter- 
mined in accordance with the layout regularity. Most of 
pieces of information of layout and string in the layout 
definition of such a document can be used for extracting 
keywords from the non-structured document. so 

The layout information extracting module 2205 
refers to the layout definition 2204 and extracts items 
necessary for extracting a keyword as many as possible 
from the information of layout and string used for output- 
ting each element. As described earlier, this item itself is 55 
called a "required item", and the information extracted 
for each item is called a "required item content". 

Fig. 29 shows an example of required items for 
each keyword when the keyword rule shown in Fig. 5 is 



generated. An [element name] 3001 is the name of a 
subject string-corresponding element and takes a value 
of a string. A peft-hand space] 3002 and a [right-hand 
space] 3003 indicate the conditions of what minimum 
character spaces from the right and left sides of a region 
where a document is output are reserved for outputting 
the string of the element. A [first-line indent] 3004 indi- 
cates what character spaces at the left side are 
reserved at the first line which often takes a different off- 
set from other lines. A [string condition] 3005 indicates 
what string describes the keyword. An [arrangement] 
3006 indicates how keywords are arranged in the region 
defined by the [left-hand space] 3002 and [right-hand 
space] 3003. This [arrangement] 3006 takes four values 
"right justify", left justify", "centering" and "equal 
space". A [previous string] 3007 and a [next string] 3008 
indicate strings which show what strings are sand- 
wiched between string-corresponding elements appear- 
ing before and after the subject keyword. 

The layout information extracting module 2205 
refers to the layout definition 2204 and extracts informa- 
tion of the required items shown in Fig. 29, i.e., the 
required item contents, as much as possible. Fig. 30 
illustrates an example of a process of extracting the 
required item contents from the layout definition shown 
in Fig, 28. 

In order to extract the required item content of a 
string-corresponding element, the definition of the 
string-corresponding element in the layout definition is 
used. For example, the required item for the "ARTI- 
CLENO." is extracted from the definitions 2912 to 2922 
of the "ARTICLENO." shown in Fig. 28. 

The required items peft-hand space] and [right- 
hand space] are the items indicating the same contents 
of the [offset 1] and [offset 2] of the layout definition. 
Therefore, only the unit of length is changed from pt to 
the number of characters. Specifically, the values of the 
[offset 1] and [offset 2] are divided by the value of the 
[character pitch] (3101 and 3102). The required item 
[first-line indent] has the content of the sum of the [offset 
1] in the layout definition and [first-line displacement] 
divided by the [character pitch] (3103]. The content of 
the required item [string condition] is generated by refer- 
ring to the [string information] in the layout definition 
(3104). However, in the example shown in Fig. 28, the 
[string information] is "CONTENT for all elements so 
that the string in the document instance itself is output 
and specific information of a string cannot be obtained 
from the layout definition. Since the required item 
[arrangement] is the item representing the same con- 
cept as the [placement] in the layout definition so that 
the values are converted in accordance with the rules 
3105. Into the content of the required item [previous 
string], the content of the [connection with previous ele- 
ment] is substituted (3106). 

The content of the required item [next string] is 
obtained by using the string-corresponding element 
information and the [connection with previous element] 
of other elements in the layout definition (3107). Specif- 
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ically, first a string-corresponding element (hereinafter 
called a "next element") backward contiguous with the 
subject string-corresponding element is obtained by 
using the string-corresponding element information. 
Next, the [connection with previous element] is checked 
for all next elements, and if the contents of all next ele- 
ments are the same, this content is set as the content of 
the [next string] of the [next string]. If there is a next 
string having the different content of the [connection 
with previous element], the content of the [next string] is 
not set. For example, from the string-corresponding ele- 
ment information shown in Fig. 27 at 2806, it can be 
known that the next string of "ARTICLENO." is only 
"ARTICLESTATEMENT". The content of the [next 
string] of "ARTICLENO." is " " of the [connection with 
previous element of "ARTICLESTATEMENT"]. 

The above processes are executed for all string- 
corresponding elements to generate the layout informa- 
tion 2206 shown in Fig. 21. 

The keyword extraction rule generating module 

2207 shown in Fig. 21 informs via the input/output 
device 2211 an operator of the string-corresponding 
element information 2203 and layout information 2206. 
This module 2207 receives supplementary information 
from the operator to add and modify the required item 
content and generate the keyword extraction rule 2212. 
A specific process of the keyword extraction rule gener- 
ating module 2207 will be described. 

The keyword information indicator module 2208 
informs the operator of the string-corresponding ele- 
ment name and which string-corresponding element is 
set as the keyword-corresponding element at a certain 
timing. If the operator instructs to set a particular string- 
corresponding element to the keyword-corresponding 
element, the keyword information indicator module 

2208 activates the supplementary information input 
module 2209 which supplements the required item con- 
tent of the string-corresponding element. If the operator 
instructs to inspect whether set keyword-corresponding 
elements satisfy at that timing the restriction condition 
that non-keywords should not be contiguous, the contig- 
uous element checking module 2210 is activated. 

Fig. 31 shows an example of an interlace for the 
keyword information indicator module 2208 to display 
information on the input/display device 2211 for the 
operator, and Fig. 32 is its process flow. The operation 
of the keyword information indicator module 2208 will be 
described with reference to Figs. 31 and 32. Upon acti- 
vation, the keyword information indicator module 2208 
reads the string-corresponding element information 
2203 and obtains the name of each string-correspond- 
ing element (3301). Reference numeral 3202 repre- 
sents a keyword information window which is 
constituted by an element name display area 3202 for 
displaying the names of all string-corresponding ele- 
ments and a format condition display area 3203 for dis- 
playing the format condition of for the string- 
corresponding element set as the keyword-correspond- 
ing element. At step 3202, the string-corresponding ele- 



ment name and the layout condition of an element set 
as the keyword-corresponding element at this timing are 
displayed. In this case, at the initial stage, the format 
condition is not set to any element so that the format 

5 condition display area 3202 displays no information. In 
order to give the format condition to a string-corre- 
sponding element and set this element as the keyword- 
corresponding element, the operator first double clicks 
the element name in the element name display area 

to 3202 with a mouse to thereby activate the supplemen- 
tary information editing module (2209 in Fig. 21) (3304). 
The detailed operation of the supplementary informa- 
tion editing module 2209 will be given later. The string- 
corresponding element name is passed to the supple- 

15 mentary information editing module 2209, and its format 
condition is received as the return argument. The string- 
corresponding element designated by the operator is 
set as the keyword -corresponding element (3305) and 
its format condition is displayed in the format condition 

20 display area 3203 (3302). In the example shown in Fig. 
31, a display at the interface at a certain timing is 
shown. At this timing, the format conditions are given to 
the two string-corresponding element of the "TITLE" 
3206 and " PAR AG RAP H N O. " 3207, which means that 

25 the two string-corresponding elements are set as the 
keyword-corresponding elements. 

Reference numeral 3204 represents a button for 
checking contiguous elements. As this button 3204 is 
clicked, the contiguous element checking module (221 0 

30 in Fig. 21) is activated which inspects whether an aggre- 
gation of keyword-corresponding elements set at this 
timing satisfy the restriction condition that non-key- 
words should not be contiguous (3306). The operation 
of the contiguous element checking module 221 0 will be 

35 later described. If the inspection judges that the key- 
word-corresponding elements satisfying the restriction 
condition are set, the operator clicks an exit button to 
instruct to terminate the process of the keyword infor- 
mation indicating module 2208. The keyword informa- 

40 tion indicator module 2208 outputs the keyword- 
corresponding element name and its format condition 
as the keyword extraction rule (2212 in Fig. 21) and ter- 
minates the process (3307). The contents of the proc- 
esses by the keyword information indicator module 

45 2208 have been described above. 

Fig. 33 shows an example of an interface of the 
supplementary information editing module 2209 acti- 
vated when the element name is double clicked during 
the operation of the keyword information indicator mod- 

so ule 2208, and Fig. 34 shows the process flow. The sup- 
plementary information editing module 2209 reads the 
name of the string-corresponding element set as the 
keyword-corresponding element whose layout condition 
is to be set. the name being passed from the keyword 

55 information indicator module 2208 (3501), and reads 
the required item content of the element from the layout 
information (2206 in Fig. 21 (3502). The required item 
content is displayed on a required item editor 3401 
(3503). The required item editor 3401 consists of win- 
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dows in which the display content can be edited. If the 
display content is different from the description format of 
the non-structured document, the operator changes its 
content Since the required item content (e.g., [string 
condition] in the extraction example shown in Figs. 30 
and 31) which cannot be extracted by the layout infor- 
mation extracting module 2205 is not displayed on the 
required item editor, the operator enters the required 
item content to the required item editor (3504 to 3503). 
An example after the [string condition] is entered is 
shown in Fig. 30 under the title of "after entering string 
condition". 

After the required item contents are edited and all 
the required item contents match the description format 
of the non-structured document, the operator clicks an 
exit button 3402 to instruct the termination of the proc- 
esses of the supplementary information editing module 
2209. The supplementary information editing module 
2209 generates the format conditions from the edited 
required item contents of the string-corresponding ele- 
ments set as the keyword-corresponding elements 
(3506), and passes the format conditions as the return 
argument to the keyword information indicator module 
2208 (3507). The process flow of generating the format 
condition from the required item content is shown in Fig. 
35. This process flow is added with an example of steps 
surrounded by a broken line in Fig. 35 which step con- 
verts the required item content of "ARTICLENO." shown 
under the title of "after entering string condition" into the 
format condition. 

First, the content (e.g., "ARTICLE"NUM1) of the 
required item [string condition] is substituted into the for- 
mat condition, and it is checked whether the content of 
the required item [previous string] is line feed (3601). If 
line feed, the flow advances to step 3603, whereas if 
not, the format condition is surrounded by "[" and T and 
V and the content of the [previous string] are added 
just before it (3602). In this case, a blank is converted 
into SPC [integer]. Next, at step 3603 it is checked 
whether the content of the required item [next string] is 
line feed. If line feed, T is added to the end of the for- 
mat condition (3605) and the flow advances to step 
3606, whereas if not, the format condition is surrounded 
by "[" and "]" if the format condition does not contain "[" 
and T a™* the content of the [next string] and "+" are 
added just after H (3604, e.g., ["ARTICLE"NUM1 SPC1 
+]). At step 3606 it is checked whether the content of the 
required item [arrangement] is "centering" or not If 
"centering", "C" is added to the start of the format con- 
dition (3607) and the generation of the format condition 
is terminated. If not "centering", the flow advances to 
step 3608 and the process A or B is executed depend- 
ing upon the content of the [arrangement]. If the content 
of the [arrangement] is "left justify", the process A is per- 
formed, if "right justify", the process B is performed, and 
if "equal space", both the processes A and B are per- 
formed, to thereafter terminate the generation of the for- 
mat condition. In the process A, " A SPCx" is added to the 
start of the format condition (3609) where x is the con- 



tent of the [first-line indent] (e.g., A SPC0 ["ARTI- 
CLED UM1] SPC1 +). In the process B, first "SPCy$" is 
added to the end of the format condition (361 0) where y_ 
is the content of the [right-hand space. Next, if " A " or V 

5 at the start of the format condition, "!" is added to the 
start of the format condition (361 1 ). 

The supplementary information editing module 
2209 passes the obtained format condition as the return 
argument to the keyword information indicating module 

10 (3507 in Fig. 34) which in turn executes the process. 
The above description is the contents of the processes 
by the supplementary information indicating module 
2209. 

Fig. 36 shows the process flow of the contiguous 

is element checking module 221 0 activated when the con- 
tiguity check button is clicked during the operation of the 
keyword information indicating module (2208 in Fig. 21), 
and Fig. 37 shows an example of its processes. The 
contiguous element checking module 2210 first reads 

20 the keyword-corresponding element given by the key- 
word information indicating module 2208 (3701, e.g., 
3801). Next, it reads the string-corresponding element 
information (2203 in Fig. 21) (3702). Then, non-key- 
word-corresponding elements are obtained as an 

25 aggregation of all string-corresponding elements sub- 
tracted by the keyword-corresponding elements (3703, 
e.g.. 3802). At step 3704, by referring to the string-cor- 
responding element information, it is checked whether 
there is a non-keyword corresponding element in the 

30 next element of another non-keyword-corresponding 
element (e.g., 3803). If there is such a non-keyword cor- 
responding element, the operator is informed of the 
contiguous non-keyword-corresponding element (3705, 
e.g., 3804) to thereafter terminate the process. If there 

35 is not, the operator is informed of such effect (3706) to 
thereafter terminate the process. The above description 
is the process contents of the contiguous element 
checking module 221 0. 

With this embodiment, the keyword extraction rule 

40 can be generated. The programs described with this 
embodiment may be stored in a storage such as a hard 
disk, a floppy disk, an optical disk, and a CD-ROM. 

Claims 

45 

1 . A method of generating a structured document for a 
structured document generating apparatus having 
at least an input/output device (1), a control unit (3), 
and a repository (2) wherein a non-structured doc- 

so umerrt(101) not explicitly given the document struc- 
ture and input from said input/output device is 
converted into a structured document (116) explic- 
itly given the document structure, in accordance 
with a document structure definition defining the 

55 document structure, said method comprising the 
steps of: 

modifying a given first document structure defi- 
nition (106) so as to match the document struc- 
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ture of said input non-structured document 
(101) and generate a second document struc- 
ture definition (107); 

said control unit (3) generating a parsing rule 
(111) used for performing a parsing process 5 
suitable for the document structure of said sec- 
ond document structure definition, by modify- 
ing marks constituting said second document 
structure definition and modifying said second 
document structure definition so as to make the w 
positional order of said marks in one-to-one 
correspondence; 

in accordance with said generated parsing rule, 
generating a first structured document (114) 
from said input non-structured document; and 15 
in accordance with difference data between 
said first document structure definition and said 
second document structure definition, convert- 
ing said generated first structured document 
into a format matching said first document 20 
structure definition to thereby generate a sec- 
ond structured document (116). 

2. A method of generating a structured document 
according to claim 1 , wherein said first and second 25 
document structure definitions (106, 107) include 
mark trains disposed for defining the relationship 
between character strings constituting a document 

to be input 

30 

3. A method of generating a structured document 
according to claim 2, wherein said parsing rule 
(111) is generated by embedding a process of 
explicitly giving the parsed portion of document 
structure to be parsed, into an interim rule gener- 35 
ated by converting said second document structure 
definition in accordance with a given rule conver- 
sion regulation (112). 

4. A method of generating a structured document 40 
according to claim 2, wherein the mark strings of 
said first and second document structure definitions 
(106, 107) describe the document structure, repre- 
senting a conceptional relationship between the 
character strings of a document to be input, by dis- 45 
posing names representing the concept of each 
character string. 

5. A method of generating a structured document 
according to claim 2, further comprising the steps so 
of: 

extracting a keyword from said non-structured 
document in accordance with a predetermined 
rule (103) regarding the character strings of a ss 
document to be input, and generating a key- 
word/text model (104) including at least charac- 
ter strings extracted as keywords and other 
character strings; and 



converting said keyword/text model into said 
first structured document (114) by using said 
parsing rule. 

6. A method of generating a structured document 
according to claim 5, wherein if the same character 
string in the same character region is extracted as a 
plurality of keywords, said control unit (3) selects a 
proper one from the plurality of keywords in accord- 
ance with whether the parsing process succeeds or 
fails. 

7. A method of generating a structured document 
according to claim 5, wherein said keyword is 
extracted by analyzing each character string in said 
non-structured document (1101) with reference to a 
keyword extraction rule (103) having a correspond- 
ence between a format condition of each character 
string and a keyword name. 

8. A method of generating a structured document 
according to claim 7, wherein said keyword extrac- 
tion rule (103) is generated, if a layout definition of 
said non-structured document is given, by modify- 
ing said layout definition in accordance with a pre- 
determined rule. 

9. A structured document generating apparatus hav- 
ing at least an input/output device (1), a control unit 
(3), and a repository (2) wherein a non-structured 
document (101) not explicitly given the document 
structure is converted into a structured document 
(116) explicitly given the document structure, com- 
prising: 

keyword extracting means (102) for extracting 
as a keyword a character string representative 
of a constituent element of the document struc- 
ture of said non-structured document, in 
accordance with layout information about lay- 
out and character string information of said 
non-structured document; 
rule generating means (110) for generating a 
rule from a second document structure defini- 
tion obtained by modifying a given first docu- 
ment definition, said rule being used for 
converting said non-structured document into 
said structured document matching said sec- 
ond document structure; and 
structured document generating means (113, 
105, 115) for generating said structured docu- 
ment by using the keyword extracted by said 
keyword extracting means and the rule gener- 
ated by said rule generating means. 

10. A method of extracting a keyword of a particular 
character string representing a constituent element 
of the structure of a document, comprising the 
steps of: 
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extracting (2202) document structure informa- 
tion from a document structure information 
given in advance to a non-structured document 
and generating string-corresponding element 
information, the string-corresponding element, 5 
which is element of document structure consti- 
tuting each character string of said non-struc- 
tured document; 

generating layout information (2203) from a 
layout definition given to said non-structured 10 
document, the layout definition defining an out- 
put format of said non-structured document; 
and 

extracting the keyword (2208) in accordance 
with a rule made from said string-correspond- is 
ing element information and said layout infor- 
mation. 

11. A method of extracting a keyword according to 
claim 1 0, wherein said step of generating the string- 20 
corresponding element information generates as 
me string-corresponding element information conti- 
guity-relationship between said string-correspond- 
ing element. 

25 

12. A method of extracting a keyword according to 
claim 10. wherein said step of generating the layout 
information generates as the layout information a 
layout used when the constituent element of the 
document structure is output, and information of 30 
each character string. 
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FIG.15 

<LAW> 

< CHANGE > 

< OPENINGTITLE > 

QAAPREFECTURE FLOOD DEFENCE SIGNAL REGULATION 

< / OPENINGTITLE > 
PROMULGATION > 
PROMULGATIONDATE > 

SHOWA 24. OCTOBER, 6 
^< / PROMULGATIONDATE > 

< ESTABLISHEDREGULATIONNO. > 
AAPREFECTURE REGULATION NO. 78 

< / ESTABLISHEDREGULATIONNO. > 

< PROMULGATIONSTATEMENT > 

AAPREFECTURE FLOOD DEFENCE SIGNAL REGULATION IS 
TO BE PROMULGATED AS IN THE FOLLOWING 
^— < / PROMULGATIONSTATEMENT > 

< / PROMULGATION > 

< TITLE > 

AAPREFECTURE FLOOD DEFENCE SIGNAL REGULATION 
< /TITLE > 
</ CHANGE > 

< PRESENTREGULATION > 

< ARTICLE > 
<ARTICLENO.> 
ARTICLE 1 

< / ARTICLENO. > 

< FIRSTPARAGRAPH > 

< FIRSTPARAGRAPHSTATEMENT > 

FLOOD DEFENCE SIGNALS STIPULATED IN ARTICLE 13, 
PARAGRAPH 1 OF THE FLOOD DEFENCE LAW 
(SHOWA 24, JUNE, LAW NO. 193) INCLUDE THE FOLLWING. 

< / FIRSTPARAGRAPHSTATEMENT > 

< PARAGRAPH > 

< PARAGRAPHNO. > 
(1) 

< / PARAGRAPHNO. > 

< PARAGRAPHSTATEMENT > 

FIRST SIGNAL : FOR NOTIFYING AN ALARM WATER LEVEL 

< / PARAGRAPHSTATEMENT > 

< / PARAGRAPH > 
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FIG. 17 

<LAW> 

< PROMULGATION > 

< PROMULGATIONSTATEMENT > 

AAPREFECTURE FLOOD DEFENCE SIGNAL REGULATION IS 
TO BE PROMULGATED AS IN THE FOLLOWING 

< / PROMULGATIONSTATEMENT > 

< PROMULGATIONDATE > 
SHOWA 24, OCTOBER, 6 

< / PROMULGATIONDATE > 

< PROMULGATIONOFFICER > 

< OFFICIALTITLE > 
[ NONE J 

</ OFFICIALTITLE > 

< NAME > 
[ NONE ] 
</ NAME > 

< / PROMULGATIONOFFICER > 

< / PROMULGATION > 

< ESTABLISHEDREGULATIONNO. > 
AAPREFECTURE REGULATION NO. 78 

< / ESTABLISHEDREGULATIONNO. > 

< TITLE > 

AAPREFECTURE FLOOD DEFENCE SIGNAL REGULATION 
</ TITLE > 

< PRESENTREGULATION > 

< ARTICLE > 

< ARTICLENO. > 
ARTICLE 1 

</ ARTICLENO. > 

< FIRSTPARAGRAPH > 

< FIRSTPARAGRAPHSTATEMENT > 

FLOOD DEFENCE SIGNALS STIPULATED IN ARTICLE 13, 
PARAGRAPH 1 OF THE FLOOD DEFENCE LAW 
(SHOWA 24, JUNE, LAW NO. 193) INCLUDE THE FOLLWING. 

< / FIRSTPARAGRAPHSTATEMENT > 

< PARAGRAPH > 

< PARAGRAPHNO. > 
(1) 

</ PARAGRAPHNO. > 

< PARAGRAPHSTATEMENT > 

FIRST SIGNAL : FOR NOTIFYING AN ALARM WATER LEVEL 

< / PARAGRAPHSTATEMENT > 
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FIG.21 
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First [nt] = { ^ } 



ARGUMENT ELEMENT -nl 
-^2501 



FIG.24 
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FIG.25 
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First (PARAGRAPH NO.] 

First [PARAGRAPH STATEMENT] 
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First [OPENING TITLE] 
First [PROMULGATION] 
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First [TITLE] 
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First [ARTICLE NO.] 

First [ARTICLE STATEMENT] 

First [repO] 

First [PARAGRAPH] 

First [PARAGRAPH NO.] 

First [PARAGRAPH STATEMENT] 



= { ARTICLE NO. ARTICLE STATEMENT} 
= { TITLE} 

= ( OPENING TITLE) 

= { PROMULGATION STATEMENT, 

ESTABLISHED REGULATION NO. } 
= { PROMULGATION STATEMENT } 
= { PROMULGATION DATE} 
= { ESTABLISHED REGULATION NO.} 
= { PROMULGATION STATEMENT} 
= { TITLE) 
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STATEMENT) 

= { ARTICLE STATEMENT, PARAGRAPH 
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= { ARTICLE NO.) 
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= { PARAGRAPH STATEMENT } 

= { PARAGRAPH STATEMENT } 

= { PARAGRAPH NO.} 

= { PARAGRAPH STATEMENT } 
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FIG.28 



2901 — — TITLE { 

2902 v— [ FONT NAME ] GOTHIC 

2903 — — [ FONT SIZE ] 12pt 

2904 ~ [ CHARACTER PITCH ] 1 4pt 

2905 — — [ OFFSET 1 ] Opt 

2906— -— [OFFSET 2] Opt 

2907— -— [FIRST LINE DISPLACEMENT] Opt 

2908 ~ [ CONNECTION WITH PREVIOUS ELEMENT] "¥n" 

2909 ~ [ STRING INFORMATION ] CONTENT 

2910 — [PLACEMENT] center center 

2911— — } 

2912— — ARTICLE N0.{ 

2913— — [FONT NAME] GOTHIC 

2914— [FONT SIZE] 10pt 
29 1 5 — — [ CHARACTER PITCH ] 1 2pt 

2916- ^ [OFFSET 1] 12pt 

2917- ^- [OFFSET 2] Opt 
291 8 — — [ FIRST LINE DISPLACEMENT] Opt 
2919—— [CONNECTION WITH PREVIOUS ELEMENT] "¥n" 

2920 — — [ STRING INFORMATION ] CONTENT 

292 1 — [ PLACEMENT ] center start 

2922 — } 

2923 ~ ARTICLE STATEMENT { 

2924-^— [FONT NAME] MING 

2925 ~ [ FONT SIZE ] 10pt 

2926 — - [ CHARACTER PITCH ] 12pt 
2927-^- [OFFSET 1] 12pt 
2928 ~ [OFFSET 2] Opt 
2929 ~ [ FIRST LINE DISPLACEMENT] Opt 
2930— [ CONNECTION WITH PREVIOUS ELEMENT] " " 
2931 — — [STRING INFORMATION] CONTENT 
2932 ~ [PLACEMENT] center start 
2933 — ) 
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[FIRST-LINE INDENT] 
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FIG.37 
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